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Abstract 


In the last decade, the rise of affordable high-throughput sequencing technologies has led to rapid advances 
across the biological sciences. At the time of writing, annotated reference genomes are available within most 
clades of eukaryotic pathogens, and including un-annotated sequences over 550 genomes are available in total. 
This has greatly facilitated studies in many areas of parasitology. In addition, the volume of functional genomics 
data, including analysis of differential transcription and DNA-protein interactions, has increased exponentially. 
With this unprecedented increase in publicly available data, tools to search and compare datasets are also 
becoming ever more important. A number of database resources are available, and access to these has become 
fundamental for a majority of research groups. This chapter discusses the current state of genomics research for 
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a number of eukaryotic parasites, discussing the genome and functional genomics resources available at the time 
of writing and highlighting functionally important or unique aspects of the genome for each group. In addition 
publicly accessible database resources pertaining to eukaryotic parasites are also discussed. 


Introduction 


Arguably the field of genomics began when Friedrich Miescher first isolated DNA in 1869 [1], paving the way 
for the work of many scientists in understanding the role of this material in heredity [2], discovering its double- 
helical structure [3] and deciphering the genetic code [4]. However, the technological advance that the entire 
field of genomics rests on is sequencing [5-7]. The ability to read the genetic code is relatively new, having only 
been developed in the last 50 years. Sanger sequencing, which relies on dideoxy chain termination, remained the 
method of choice for several decades; however, early implementations of dideoxy chain termination methods 
were not well parallelized and analysis was initially a painstaking manual process. Later, data analysis was 
carried out computationally, but limited by the processing capacity of computers of the era. These factors 
combined to limit early sequencing to individual genes, small genomic fragments or the genomes of small 
viruses and organelles. The emergence of techniques such as fluorescence-based cycle sequencing and the 
polymerase chain reaction in addition to the increased use of computational power to automatically read and 
analyze results, allowed larger scale genome projects to be undertaken [8]. Indeed within a few years of this 
marriage of techniques and fields the first bacterial, protozoan, fungal, plant and animal genomes were 
sequenced [9-12] [13] [14]. Despite these advances, sequencing of whole genomes remained relatively costly and 
time consuming. As an example, sequencing the human genome took roughly 10 years at a price tag of 3 billion 
US dollars (https://www.genome.gov/11006943) [15]. 


The first forays into high-throughput analysis of sequence data came in the form of microarrays. A microarray 
consists of a panel of oligo-nucleotide probes bonded to a solid surface such as a glass slide. Hybridisation of 
nucleic acids from a specimen to individual probes is detected by the intensity of a fluorescent signal. This 
technique was the first to make querying of sequence polymorphisms, transcript expression levels and segmental 
duplications possible on a genomic level, and cheap enough to be widely available. In addition, microarrays 
forced the development of computational tools and techniques to handle data on a genomic scale. However, an 
important limitation of microarrays is the requirement for prior knowledge of the genome and the coincident 
inability to make de novo discoveries (i.e., one can query the presence of known SNPs, but not discover new 
SNPs). A large volume of functional genomics data has been obtained using microarray technologies, but with a 
small number of exceptions (such as diagnostics), microarrays have for the most part been superseded by next- 
generation sequencing technologies. 


Two factors have been instrumental in enabling sequencing to be taken to the next level: continued growth of 
computer processing capacity following Moores law [16] and the development of “next-generation” sequencing 
(NGS) methods (also known as second generation sequencing), which enable massively parallel sequencing of 
millions of fragments by synthesis [17, 18]. One of the major advantages of next generation sequencing is that it 
can be applied to a wide variety of methodologies including (readers are directed to an excellent series of 
manuscripts http://www.nature.com/nrg/series/nextgeneration/index.html) and unlike microarrays, does not 
require any prior knowledge of the sample: 


* DNA sequencing: High-throughput technology makes sequencing for de novo assembly of new genomes 
ever more affordable. Comparison of resequenced isolates against a reference is a common technique for 
discovery of sequence polymorphisms, while analysis of coverage depth and mapping topology can reveal 
information about structural variations such as chromosomal translocations and segmental duplications. 

e RNA sequencing: Sequencing of RNA can provide important information about gene structure such as the 
locations of UTRs and intron/exon boundaries, and the presence of alternative or trans- splice variants. 
Analysis of RNAseq coverage depth over a time course or under different experimental conditions reveals 
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information about transcription of genes under differing conditions and combination of this technique 
with ribosomal profiling enables identification of the translational status of the genome. Specialised 
sample preparation techniques enable the sequencing of non-coding RNA species such as those involved 
in the RNAi-mediated translational-silencing. 

e Epigenomics: Chromatin immunoprecipitation (ChIP)-sequencing is a powerful technique that allows 
determination of the "footprint" of DNA-binding proteins. This can be used to examine promotor-binding 
sites, transcription, replication and repair mechanisms and factors such as histone-modification that can 
affect transcription. Other techniques are available, such as bisulfite sequencing which enables profiling of 
DNA methylation. 

e Metagenomics: Sequencing of DNA extracted from samples that contain mixed populations of organisms 
can be used to survey populations in environmental samples (such as soil) or biological samples (such as 
gut microbiomes). Metagenomics techniques can be used to determine the makeup of populations and to 
survey how this changes over time or under different conditions. Metagenomics analysis is a fast-growing 
field in which the problems of analysis have not yet been solved. 


It is not surprising that the dawn of large scale sequencing projects necessitated an expansion in the field of 
bioinformatics and data management. As high-throughput sequencing has become cheaper it has moved from 
being a specialist technique to a tool used daily in labs across the world. This has necessitated the development of 
user-friendly tools that can run on desktop machines, and thrust the field of bioinformatics into the foreground. 
The expansion of massively parallel sequencing has also led to a revolution in the teaching of biology, with 
computational techniques for management and analysis of genomic-scale datasets now being taught in many 
undergraduate courses. Data warehousing is also becoming a priority, with data repositories such as the National 
Center for Biotechnology Information (NCBI) [19] having to rethink both their submissions procedures and 
their approaches to storage. 


Parasite genomics 


The field of parasite genomics has benefited tremendously from the sequencing revolution. While only a handful 
of parasite genomes were sequenced by 2005, the number has exploded to over 550 genomes (http:// 
genomesonline.org) [20] by 2015. This number reflects both annotated and unannotated genomes and will 
already be out of date by the time this chapter is in print. Besides the technological advances, this increase in 
sequences has been aided by a number of initiatives with parasitology components. These include projects 
supported by the Wellcome Trust Sanger Institute in the United Kingdom and a number of parasite specific 
genome sequencing white papers supported by the National Institute of Allergy and Infectious Diseases (NIAID) 
Genomic Centers for Infectious Diseases (GCID) in the United States. Together these centers have generated 
sequence, assemblies and annotation from many important human and veterinary parasites. All data from these 
projects are available via project specific websites (ie. GeneDB: http://genedb.org) [21] and/or through the 
International Nucleotide Sequence Databases (GenBank, EMBL Nucleotide Sequence Database, and the DNA 
Data Bank of Japan [22-24]). 


General features of protozoan parasite genomes 


Amoebae 


The amoebae, single celled eukaryotes that shared a most recent common ancestor with humans after plants but 
before fungi, are from a sparsely sampled and little studied domain of the tree of life. As with most protists the 
best known are those that cause disease in humans, which of the amoebae are the Entamoebae and the 
Acanthamoebae. 'The Entamoebae are intestinal parasites or commensals of a wide range of animals in addition to 
humans. The Acanthamoebae are free-living amoebae of interest to humans primarily as opportunistic 
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pathogens. These two and the social amoebae such as Dictyostelium species, are the best studied amoebae and 
those for which there exist sequenced genome assemblies, [25-27]. 


Entamoebae 


The described species of Entamoeba are generally obligate parasites or commensals. They have simple life cycles 
consisting of a vegetative stage, the trophozoite, which lives in the hosts large intestine and feeds upon bacteria 
and a transmissible stage, the cyst, which allows survival outside the host and transmission to a new host. 
Possible exceptions to these rules include two species (Entamoeba moshkovskii and Entamoeba bangladeshi) that 
can survive outside of the host and may be primarily free-living organisms, and one species (Entamoeba 
gingivalis) that colonises the mouth and may have lost the ability to form cysts instead being transmitted directly 
in the trophozoite form. 


The human pathogen Entamoeba histolytica is the most studied species of the genus. A draft genome assembly 
was first published in 2005, with subsequent updates, though it remains fragmented and chromosomes cannot 
be defined [27-29]. Unusual features of the E. histolytica genome include an unusual organisation of tRNA 
genes, which occur in arrays of sets of tRNA genes separated by repetitive intergenic DNA [30], and rRNA genes 
encoded on extrachromosomal circular DNA occurring in multiple copies per cell [31]. Two features of 
Entamoebae associated with their anaerobic environments are the loss of the function and genome of the 
mitochondrion, which occurs as a relict organelle, the mitosome, and the related lateral transfer of genes, many 
involved in anaerobic metabolic processes and apparently derived from anaerobic bacteria [32]. 


Genomic re-sequencing suggests little nucleotide diversity among E. histolytica, even among lineages derived 
from widely separated geographical locations [33]. In contrast, gene copy number variation appears to be 
extensive [33], which may be associated with the genomic plasticity observed among E. histolytica lineages [34]. 
Studies using tRNA repetitive intergenic DNA or SNP markers also suggest very little linkage disequilibrium 
among markers, which suggests extensive outcrossing among parasite lineages [33, 35, 36]. Genetic diversity in 
other Entamoeba species is largely unknown, apart from studies of the 18S ribosomal RNA gene, which indicate 
that some 'species may in fact be species complexes [37]. 


Genomic data exist for four other species of Entamoeba: E. nuttalli, E. dispar, E. moshkovskii and E. invadens. For 
the first three of these, the data are available but no reference publication yet exists. Most closely related to E. 
histolytica, Entamoeba nuttalli is a pathogen of macaques [38-40]. Entamoeba dispar infects humans and is of 
primary interest as a relative of E. histolytica (only recently defined as a separate species) that appears to be non- 
pathogenic [41]. Entamoeba moshkovskii is of uncertain status as a parasite or a free-living organism and has 
recently been associated with disease in humans [42, 43]. Entamoeba invadens, a pathogen of reptiles, is of 
primary interest as a model species for the process of encystation (which cannot be induced in axenic E. 
histolytica cultures). The genome of E. invadens is considerably larger than that of E. histolytica [44]. Genomic 
data for a number of E. histolytica strains, from a range of geographical locations and associated with different 
disease manifestations, are available via AmoebaDB (http:// AmoebaDB.org) [45] (Table 1.1). 


Table 1.1. Genome datasets of amoebae available in AmoebaDB. 


Species Strain Dataset Sequencing platform | Reference 
Entamoeba histolytica HM-1:IMSS De novo genome assembly Sanger [27, 28] 
Entamoeba histolytica HM-1:IMSS-A De novo genome assembly 454, Illumina 

Entamoeba histolytica HM-1:IMSS-B De novo genome assembly 454, Illumina 

Entamoeba histolytica HM-1:CA De novo genome assembly 454, Illumina 

Entamoeba histolytica HM-3:MSS . De novo genome assembly 454, Illumina 


Entamoeba histolytica KU27 De novo genome assembly 454, Illumina 
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Species Strain Dataset Sequencing platform Reference 
Entamoeba histolytica KU48 De novo genome assembly 454, Illumina 

Entamoeba histolytica KU50 De novo genome assembly 454, Illumina 

Entamoeba histolytica MS96-3382 De novo genome assembly 454, Illumina 

Entamoeba histolytica DS4-868 De novo genome assembly 454, Illumina 

Entamoeba histolytica Rahman De novo genome assembly 454 

Entamoeba histolytica HM-LIMSS-A Re-sequencing SOLiD [33] 
Entamoeba histolytica HM-L:IMSS-B Re-sequencing SOLiD [33] 
Entamoeba histolytica Rahman Re-sequencing SOLiD [33] 
Entamoeba histolytica 2592100 Re-sequencing SOLiD [33] 
Entamoeba histolytica MS84-1373 Re-sequencing SOLiD [33] 
Entamoeba histolytica MS27-5030 Re-sequencing SOLiD [33] 
Entamoeba histolytica PVBMOSB Re-sequencing SOLiD [33] 
Entamoeba histolytica PVBMOSF Re-sequencing SOLiD [33] 
Entamoeba histolytica HK-9 Re-sequencing SOLiD [33] 
Entamoeba histolytica IULA:1092:1  Re-sequencing SOLiD [33] 
Entamoeba nuttalli P19 De novo genome assembly Illumina 

Entamoeba dispar SAW760 De novo genome assembly Sanger 

Entamoeba moshkovskii Laredo De novo genome assembly 454 

Entamoeba invadens IP1 De novo genome assembly Sanger [44] 
Acanthamoeba castellani Neff De novo genome assembly Sanger, 454, Illumina [25] 
Acanthamoeba castellanii Ma De novo genome assembly Illumina 

Acanthamoeba mauritaniensis — 1652 De novo genome assembly Illumina 

Acanthamoeba quina Vil3 De novo genome assembly Illumina 

Acanthamoeba astronyxis De novo genome assembly Illumina 

Acanthamoeba palestinensis De novo genome assembly Illumina 

Acanthamoeba sp (T4b-type) De novo genome assembly Illumina 

Acanthamoeba triangularis SH621 De novo genome assembly Illumina 

Acanthamoeba sp Incertae sedis De novo genome assembly Illumina 

Acanthamoeba sp Galka De novo genome assembly Illumina 

Acanthamoeba lugdunensis L3a De novo genome assembly Illumina 

Acanthamoeba culbertsoni Al De novo genome assembly Illumina 

Acanthamoeba rhysodes Singh De novo genome assembly Illumina 

Acanthamoeba lenticulata PD2S De novo genome assembly Illumina 


Acanthamoebae 


The Acanthamoebae are of importance for human health as a cause of keratitis when they infect the eye,often via 
contaminated contact lenses [46]. More usually, they are free-living, soil-dwelling pathogens of bacteria. 
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A draft genome assembly of Acanthamoeba castellanii was published in 2013 [25]. The genome encodes large 
families of genes involved in cell signalling and environmental sensing, such as protein kinases [25]. As in the 
Entamoebae a proportion of genes appear to have been acquired by lateral gene transfer, though the number of 
such genes in A. castellanii is larger and a larger proportion appear to have been acquired from aerobic and free- 
living bacteria [25]. Interestingly, in contrast to Entamoeba genes, which contain few introns, Acanthamoeba 
genes are intron-rich [25]. Thirteen additional Acanthamoeba species genome sequence assemblies, representing 
a geographically diverse range of species and strains, were recently made available via AmoebaDB (Dr. Andrew 
Jackson, University of Liverpool; Table 1.1). 


Giardia 


Giardia intestinalis, also known as Giardia duodenalis or Giardia lamblia, is a unicellular protozoan parasite that 
infects the upper intestinal tract of humans and animals [47]. The disease, giardiasis, manifests in humans as an 
acute diarrhea that can develop to a chronic diarrhea but the majority of infections remain asymptomatic [47]. 
Giardiasis has a global distribution with 280 million cases reported annually, with its impact being more 
pronounced in the developing world. 


G. intestinalis is divided into eight morphologically identical genotypes or assemblages (A to H). Only 
assemblages A and B have been associated with human infections and they are further divided into sub- 
assemblages: AI, AII, AHI, BIIL and BIV [48]. Despite extensive efforts to associate specific assemblages to 
symptoms, conflicting results have been obtained and there is to date no clear correlation between assemblage 
and symptoms. 


Giardia, like the other diplomonads, has two nuclei and each nucleus is diploid, resulting in a tetraploid genome 
[49]. G. intestinalis has 5 different linear chromosomes with TAGGG repeats [50]. The study of the genome 
structure and architecture in Giardia using pulsed-field gel electrophoresis (PFGE) revealed differences in size of 
individual chromosomes within and between G. intestinalis isolates [51]. The size differences were attributed to 
frequently recombining telomeric regions and differences in copy number of rDNA arrays [50]. Evidence of 
aneuploidy has been suggested in individual Giardia cells based on cytogenetic evidence [52], with the most 
common karyotype differing between different assemblage A and B isolates. 


The genomes of six G. intestinalis isolates, representing three different assemblages (A, B and E), are available to 
date [53-56]. The first genome to be sequenced was WB-C6 (assemblage A1), which has a haploid size of ~11.7 
MB distributed over the five chromosome [55]. The compact genome contains few introns and promoters are 
short and AT rich. 6470 open reading frames (ORFs) were identified but only 4787 were later shown to be 
associated with transcription [57]. Genes are placed on both DNA strands and sometimes even overlapping. 
Reduction of components in metabolic pathways, DNA replication and transcription was also detected. Several 
genes had bacterial origin and are candidates of lateral gene transfer [55]. Variable surface proteins (VSPs) are 
involved in antigenic variation in Giardia and later analyses have shown that there are 186 unique VSP genes in 
the WB genome [53]. Chromosome-wide maps have been established by optical mapping of the WB genome 
[58]. The results resolved some misassemblies in the genome and indicated that the actual genome size of the 
WB isolate is 12.1 Mb, in close agreement with PFGE analyses. The major discrepancy was an underestimation 
of the size of chromosome 5, the largest of the Giardia chromosomes. Chromosome 5 contained an 819 kbp gap 
in the optical map, most likely rDNA repeats [58]. 


Shortly after publication of the WB genome the genome of the GS isolate (assemblage B) was sequenced using 
454 technology [59]. However, the genome was highly fragmented with 2931 contigs. 4470 ORFs were identified 
and the genomes show 7896 amino acid identity in protein coding regions. The repertoire of vsp genes was very 
different compared to the WB isolate but only 14 VSP genes were complete. The GS genome was later re- 
sequenced, resulting in 544 contigs and a much more complete repertoire of VSPs (275,[55]). Moreover, the GS 
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genome had a much higher level of allelic sequence heterozygosity (ASH) compared to WB (0.596 versus 0.0196). 
ASH was distributed differently into low and high ASH regions over the GS genomic contigs [59]. 


The third genome represents the first non-human isolate to be sequenced. The P15 isolate originates from a 
symptomatic pig (piglet no. 15) and belongs to assemblage E [54]. Assemblage E has been found to be more 
closely related to assemblage A than to assemblage B [48] and the identity of protein coding sequences was 9096 
between P15 and WB and 8196 between P15 and GS [54], consistent with earlier results. Obtaining the sequence 
of three phylogenetically distinct Giardia groups (WB, P15 and GS) made it possible to assign lineage specificity 
to the genes identified in the three genomes. 9196 of the genes (~4500 protein encoding genes) were found to be 
present in all three Giardia genomes (three-way orthologs) and 996 of genes are variable, most of which are 
members of four large gene families (the Variant-specific Surface Proteins (VSP), NEK Kinases, Protein 21.1 and 
High Cysteine Membrane Proteins (HCMP)). The highest number of isolate-specific genes (38) was found in the 
P15 isolate, followed by GS (31) and WB (5). The P15 and GS isolates shared 20 proteins to the exclusion of WB, 
with 13 of these found in a cluster of 20 kbp in the P15 genome[54]. Interestingly the ORFs in this genomic 
cluster are not expressed in any of the conditions tested. The chromosomal architecture in Giardia show core 
gene-rich stable regions with maintained gene order interspersed with non-syntenic regions harboring VSPs and 
other non-core genes. These regions often have a higher GC% and show nucleotide signatures that deviate from 
surrounding regions, in part due to the common occurrence of VSP and high-cysteine membrane protein 
(HCMP) genes that are more GC-rich than the genome on average. The level of ASH in the P15 isolate was lower 
than in the GS isolate, 0.002396 [54]. 


Three assemblage AII isolates have been sequenced (DH1, AS98 and AS175 [53, 56]. The amount of genetic 
diversity was characterized in relation to the genome of WB, the assemblage A reference genome. The analyses 
showed that the divergence between AI and AII is approximately 1 96, represented by ~100,000 single nucleotide 
polymorphisms (SNP) distributed over the chromosomes with enrichment in the variable genomic regions 
containing VSPs and HCMPs [56]. The level of ASH in two of the AII isolates (AS98 and AS175) was found to be 
0.25-0.35 96, which is 25-30 fold higher than in the WB isolate and 10 fold higher than the assemblage AII 
isolate DH1 (0.037 96, [56]). 


There is a need for further genomic analyses of Giardia genomes. The assemblage A (WB) and B (GS) reference 
genomes can be improved, which will facilitate reference-based genome mapping of data from clinical and 
environmental isolates. More isolates from the A and B assemblages should be sequenced so that all the genetic 
differences between the human infecting isolates can be identified. Genomic information from the remaining 
assemblages, C-D, F-H can reveal species-specific genomic features. Sequence data from other Giardia species 
like Giardia muris will be important for further studies of the evolution of Giardia biology and virulence. In 
addition to the underlying genomic sequence and annotation, a number of functional datasets are available for 
the GiardiaDB. 


Cryptosporidium 


Cryptosporidium are protozoan parasites with significant impact to the health of humans and livestock. They 
infect the intestinal and gastric epithelium of a variety of vertebrates, causing a disease known as 
cryptosporidiosis. Human cryptosporidiosis is responsible for diarrhea-induced death of young children in 
developing countries, and in immune-compromised adults it constitutes an acute, usually self-limiting, diarrheal 
illness that results in significant morbidity and sometimes death. A recent study found Cryptosporidium to be the 
second leading cause of moderate-to-severe diarrhea in developing countries, and diarrheal diseases to be the 
second leading cause of death among children under five globally [60]. 


There are no licensed vaccines against Cryptosporidium and the only FDA-approved drug (Nitazoxanide) is only 
effective in immunocompetent patients. Thus, the development of alternative therapeutic agents and vaccines 
against this disease is urgently required, and remains a high public health priority. The lack of a practical and 


1duo»snue|A Jouiny 1duo»snue|A Jouiny 


1duo»snue|A Jouiny 


8 Molecular Parasitology 


reproducible axenic in vitro culture system for Cryptosporidium is a major limitation to the development of 
specific anti-cryptosporidial vaccines [61, 62]. Advances in next-generation sequencing technologies and in 
genome assembly and annotation methodologies [63-66] have facilitated the generation of -omics data for 
Cryptosporidium, with genomics resources now available for multiple Cryptosporidium species (Table 1.2, [67]). 
These developments prompted a shift to in silico studies aiming to identify a wide pool of potential vaccine 
targets, to be further filtered according to properties common to antigens [68]. This approach is similar to 
reverse vaccinology studies that have led to licensed vaccines in other organisms [69, 70], and is particularly 
promising in organisms that, like Cryptosporidium, are difficult to cultivate continuously in the laboratory. 


Apart from human, Cryptosporidium species infect other vertebrates including fish, birds and rodents, and some 
species are capable of zoonotic transmission [71, 72]. Some have a somewhat restricted host range, such as 
Cryptosporidium hominis, a human parasite that infects the small intestine, Cryptosporidium muris, a gastric 
parasite of rodents, and Cryptosporidium baileyi, an avian parasite. Cryptosporidium parvum and 
Cryptosporidium meleagridis have a wider host range and are known to infect both avian and mammalian 
species, including humans. C. parvum and C. hominis are considered class B agent of bioterrorism and are 
significant causes of gastrointestinal infections worldwide. 


Table 1.2. Cryptosporidium species with completed or draft genomes. 


Species Number of draft genomes Natural host range Predilection site 
C. hominis 8 Human, primates Intestinal 

C. parvum 8 Human, Bovine Intestinal 

C. meleagridis 1 Various vertebrates Intestinal 

C. baileyi 1 Birds Respiratory 

C. muris 1 Rodents Gastric 

C. sp. chipmunk LX-2015 1 Rodents, Human Intestinal 


Cryptosporidium genomic resources 


Cryptosporidium genomes are compact, with >75% consisting of protein-coding sequences, have an average size 
of approximately 8.5 to 9.5 mega base pairs (Mbp), and each encode ~4000 genes (Table 1.2). C. parvum (isolate 
IOWA II) was the first species for which a genome was published [73]. The genome was found to be 9.1 Mbp in 
length, assembled into thirteen supercontigs. Pulsed-field gel electrophoresis studies had shown the nuclear- 
encoded genome to consist of 8 chromosomes, and therefore the assembly includes five unresolved gaps. About 
5% of the 3,807 predicted protein-coding genes in this assembly contained introns, and the average gene length 
was 1,795 base pairs (bp). At about the same time the genome of C. hominis (isolate TU502) was published [74]. 
Since the two species were known to be closely related, with about 95-9796 DNA sequence identity between 
them, the C. hominis genome was sequenced to a much lower depth of coverage. The primary goal was to 
identify differences relative to C. parvum, rather than reconstruct a gold-standard genome assembly. 
Consequently, this assembly is much more fragmented, with the likely 8 chromosomes split among 1,413 contigs, 
which are grouped into ~240 scaffolds. 


There were some fundamental differences between the annotated gene sets in the two species. The average gene 
length of C. hominis was 1,360 bp, about 500 bp less than that of C. parvum, and about 5-20% of the C. hominis 
genes were predicted to contain introns, compared to 596 in C. parvum [73, 75]. In addition, only 60% of the C. 
hominis genome was estimated to be coding compared to 75% for C. parvum. These differences are remarkable 
for such closely related taxa and were thought to be due to erroneous gene models in C. hominis due to the high 
degree of genome fragmentation. To address these questions, the genome assembly for C. hominis has recently 
been re-sequenced, assembled and annotated, improving the assembly from draft to "nearly finished" form, with 
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preliminary data available in CryptoDB.org. This effort increased the average gene length by 500 bp, bringing it 
to 1,845 bp, in line with gene length in C. parvum (Table 1.2). The improved genome assembly consists of only 
120 contigs, a ten-fold reduction in contig number relative to the original C. hominis assembly. The genome 
assembly is more comprehensive, with an additional 370 Kb sequence, also now comparable in length to that of 
C. parvum. Finally, there was a 25% increase in the predicted fraction of the genome that encodes for proteins. 
The now marked similarities between the re-annotated C. hominis gene set and that of C. parvum provide 
encouraging evidence that the predicted genes are a significant improvement over the original annotation, but 
validation of gene structures awaits community effort. C. parvum IOWA II was also recently re-annotated, based 
on full-length cDNA clone sequences and RNA-Seq data (Table 1.2, [76]). 


Both C. hominis and C. parvum are intestinal parasites. C. muris (isolate RN66), the third species sequenced, was 
chosen for two primary reasons: its evolutionary distance to C. hominis and C. parvum, and the fact that it is a 
gastric species, which is rare among Cryptosporidium parasites. Currently, the field is rapidly expanding, with the 
genome sequence for several isolates of C. parvum and of C. hominis now available, as well as the genomes of 
other species (Table 1.3). 


The availability of multiple isolate genomes per species allows analyses that can shed light into species evolution, 
including age and population structure, and will facilitate studies that address key questions of great 
translational impact, including the amino acid sequence variations in current candidate vaccine antigens, and 
the identification of genomic correlates of virulence whenever isolates with different pathogenic potential are 
available. In an effort to support research that addresses key questions in the evolution of the Cryptosporidium 
genus, and the discovery of parasite-encoded factors that control host specificity, C. meleagridis UKMEL1 was 
sequenced, a species which appears to lack host specificity and that is considerably more distantly related to C. 
hominis and C. parvum than they are to each other, but a closer relative to them that is C. muris. C. baileyi can 
complete its life cycle in embryonated chicken eggs, of critical importance for the establishment of an avian 
model system of cryptosporidiosis, and C. baileyi TAMU-09Q1 was sequenced to support its development of 
such a system. Determining the proportion of Cryptosporidium infections that are caused by human-specific 
parasites rather than by zoonotic infections remains a critical question in the field. Accordingly, the genome of a 
zoonotic infection by a Cryptosporidium species with origin in the chipmunk was conducted with the goal of 
identifying genotyping markers that differentiate among Cryptosporidium subtypes [77]. 


A major challenge for the generation of Cryptosporidium whole genome sequence data has been the need to 
propagate the parasites in vertebrate hosts, a step needed to generate DNA material in sufficient quantity and of 
the quality need for use in high-throughput sequencing applications. A novel method for preparing genomic 
Cryptosporidium DNA directly from human stool samples that satisfies the criteria these applications has now 
been developed [78]. The authors used this approach to generate five assemblies each for C. parvum and C. 
hominis. Finally, a new C. hominis (isolate UdeA01) also isolated from human stool has been sequenced 
independently [76]. 


All the genomics data described above is publicly available through CryptoDB [67]. This database also provides 
a platform to easily query the annotation and a variety of pre-computed analysis data (including homology 
information across taxa). Multiple aspects of the data can be easily visualized, including synteny, polymorphism 
and expression data. CryptoDB also contains Cryptosporidium information other than genome sequences, 
including gene expression and proteomics data (Table 1.4). 
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Table 1.2. Genome statistics for representative Cryptosporidium species. 


Scion kobe GenBank Assembly No. l Largest n a Percent 
accession length (bp) contigs contig (bp) coding (bp) coding 
genes 
C. hominis TU502 (2004) AAEL00000000 8,743,570 1413 90,444 3,886 1,360 60.4% 
C. hominis TU502_new SUB482083 9,110,085 120 1,270,815 3,745 1,845 75.8% 
(2014) 
C. parvum Iowa A AEE00000000 9,103,320 13 1,278,458 3,807 1,795 75.396 
C. parvum? Iowa AAEE00000000 9,103,320 13 1,278,458 3,865 1,783 75.7% 
C. meleagridis UKMEL1 SUB482042 8,973,224 57 732,862 4,326 1,861 89.7% 
C. baileyi TAMU-09Q1 SUB482078 8,502,994 153 702,637 3,700 1,776 77.3% 
C. muris RN66 AAZY02000000 9,245,250 84 1,324,930 3,934 1,780 79.2% 
a 2015 re-annotation 
Table 1.3. Cryptosporidium genomes available in CryptoDB. 
a Ds 4 2 7 W 
C. hominis TU502 2004 VCU AAELO1 x 8,743,570 1422 90,444 
2013 AAEL02 x 8,915,516 358 282,140 
C. hominis TU502 new 2014 IGS/Tufts SUB482083 SRS566230 9,110,085 120 1,270,815 
C. hominis 37999 2014 CDC JRXJO1 x 9,054,010 78 1,029,232 
C. hominis UKHI 2014 IGS/Tufts SUB482088 SRS566214 9,141,398 156 542,781 
C. hominis UKH3 2015 PHW LJRWO1 - 9,136,308 34 1,295,005 
C. hominis UKH4 2015 PHW LKHIOI - 9,158,280 18 1,295,931 
C. hominis UKH5 2015 PHW LKHJ01 = 9,179,731 18 1,281,265 
C. parvum Iowa II 2004 Univ. AAEEOI - 9,087,724 18 1,278,458 
Minnesota 
C. parvum UKP2 2015 PHW LKHKO0I - 9,126,082 18 1,285,807 
C. parvum UKP3 2015 PHW LKHLO1 z 9,085,686 18 1,258,884 
C. parvum UKP4 2015 PHW LKHMOI = 9,001,535 18 1,283,549 
C. parvum UKP5 2015 PHW LKHNOI - 9,283,240 18 1,284,088 
C. parvum UKP6 2015 PHW LKCKO01 = 9,112,937 18 1,296,567 
C. parvum UKP7 2015 PHW LKCLO1 = 9,221,024 18 1,295,191 
C. parvum UKP8 2015 PHW LKCJO1 - 9,203,314 18 1,288,507 
C. meleagridis UKMEL1 2014 IGS SUB482042 - 8,973,224 57 732,862 
C. baileyi TAMU-09Q1 2014 Texas A&M SUB482078 SRS566232 8,502,994 153 702,637 


C. muris RN66 2008 TIGR AAZY02 SRS000463 9,238,736 97 1,182,920 
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Table 1.3 continued from previous page. 


eec ider Year Sequencing GenBank RNA-Seq SRA Assembly No. Largest 
P Institution? Accession Accession length (bp) contigs contig (bp) 
C. sp. chipmunk 2015 CDC JXRNO1 - 9,509,783 853 478,353 
LX-2015 


a CDC: Division of Foodborne, Waterborne, and Environmental Diseases, Centers for Disease Control and Prevention; IGS: Institute 
for Genome Sciences; PHW: Public Health Wales (Microbiology); TIGR - The Institute for Genome Research; VCU: Virginia 
Commonwealth University. 


Table 1.4. Other Cryptosporidium genomic resources available in CryptoDB. 


Data type Description Species Reference 
EST EST library and predicted full length cDNA C. parvum HNJ-1 [79] 
EST ESTs from Database of Expressed Sequence Tags C. baileyi TAMU-0901, C. hominis [80] 
(dbEST) TU502, C. meleagridis UKMELI, C. 
muris RN66, C. parvum Iowa II 
RT-PCR Expression profiling of life cycle stages post-infection C. parvum Iowa II [81] 
Microarray Global gene expression in oocysts (environmental C. parvum Iowa II [82] 


stage) and oocysts treated with UV 


RNA-Seq Transcriptome of sporozoites and HTC-8 infection C. parvum Iowa II (Lippuner et al.) 
time course 

RNA-Seq Transcriptome in normal culture conditions Chromera velia CCMP2878, Vitrella [83] 

brassicaformis CCMP3155 

Mass Spectrometry Enriched cytoskeletal and membrane fractions C. parvum Iowa II [84] 

Mass Spectrometry Mitochondrial fraction proteomics C. parvum Iowa II (Putignani et al.) 

Mass Spectrometry Proteome of intact oocyst, oocyst wall and C. parvum Iowa II [85] 
sporozoites by linear ion trap MS 

Mass Spectrometry Proteome during sporozoite excystation C. parvum ISSC162 [86] 

Mass Spectrometry Sporozoite peptides from 2D gel LC-MS/MS analysis C. parvum Iowa II [87] 

SNPs SNPs determined by aligning high throughput C. parvum TU114, C. parvum Iowa II [75] 


sequencing reads of C. parvum TU114 to the C. 
parvum reference genome 


Piroplasms 


Piroplasms are a vast group of poorly characterized Haemosporidia that are named after their pyriform (pear- 
shaped) structure visible during intracellular stages in the host erythrocytes. They are found in numerous 
mammals, birds, and reptiles, and are often transmitted by ixodid ticks after parasite replication in the tick gut 
[88]. While little is known about the life cycle of most piroplasms, well-described species of Theileria commonly 
infect mammalian host leukocytes, followed by a tick-infective stage in red blood cells (RBCs), while Babesia do 
not have a leukocyte-infective stage [89, 90]. Some Babesia species are known to infect humans (B. microti, B. 
divergens, B. duncani), where they cause a malaria-like disease [89]. The diseases caused by these parasites can 
lead to fevers and even death in equid and ruminant livestock species, all around the world. Consequently, most 
of the genomics resources developed for piroplasm research to date have focused on species that infect bovids 
(Table 1.5). Most of these resources are available through PiroplasmaDB (http://PiroplasmaDB.com). 


The first piroplasm genomes were published in 2005, and consisted of Theileria species of domestic cattle and 
wild buffalo. T. parva causes a tremendous economic impact in eastern, central and southern Africa [90], while 
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T. annulata is distributed throughout much of southern Asia and southeast Europe [94]. Their genomes are small 
at ~8.3 Mbp in length, are AT-rich with GC-content of 33%, and contain ~4,000 nuclear protein-coding genes. 
These properties are similar to the genomes of other Piroplasmida that have been sequenced since (Table 1.6). 
Several genomic features were uncovered that are typical of other sequenced piroplasm genomes, such as the 
presence of telomeric multi-gene families, and several incomplete or absent biosynthetic pathways, implying a 
critical dependence on salvaging resources from their hosts [97]. These two piroplasms are unique, however, in 
their ability to transform host leukocytes to have cancer-like phenotype. This phenotype correlates with the 
expansion of two multi-gene families: the Subtelomere-encoded Variable Secreted Protein (SVSP) gene family 
and the T. annulata schizont AT-hook/T. parva Host Nucleus (TashAT/TpHN) gene families [98, 99]. Two other 
Theileria species have been sequenced, T. orientalis [98], an economically important pathogen of cattle in eastern 
Asia, and T. equi [95], which has a worldwide distribution and infects equids. These two genomes have many 
similar features, with the exception that the genome of T. equi is larger, mostly due to a significant increase in the 
number of species-specific genes, including antigen-encoding families such as the Equi Merozoite Antigen 
(EMA) family [95]. 


With a genome size of ~8.2Mpb, the B. bovis genome sequence revealed a genomic organization that is 
remarkably similar to T. parva, with extensive synteny and multiple, large multi-gene families potentially 
contributing to host immune evasion [106]. However, the smallest apicomplexan genome sequenced to date is B. 
microti, the principal agent of human babesiosis and a common pathogen transmitted by blood transfusions [89, 
101]. With a genome size of 6.5Mbp, B. microti represent the closest record of a natural representation of an 
apicomplexan “core genome" and comparative genomics with this reduced genome could yield insights into the 
most essential gene products of apicomplexans that could make excellent chemotherapeutic targets. B. microti is 
also the only example of an apicomplexan with a circular mitochondrial genome[101]. 


One apicomplexan with somewhat unclear phylogenetic position is Cytauxzoon felis. While originally 
considered a separate genus, the existence of exo-erythrocytic forms, particularly schizonts, in macrophages/ 
monocytes indicates that this parasite might be more appropriately considered in the family Theileriidae. C. felis 
is an emerging pathogen of domestic cats (Felis catus) in the southern United States, and as such its genome was 
sequenced in an effort to identify potential vaccine targets [93]. With a 9.1 Mbp genome, it has more protein 
coding genes in common with T. parva than it does with B. bovis, and was found to encode a gene that is 
syntenic with a block of genes around the T. parva antigen, and vaccine candidate, p67 [93]. 


There are currently no licensed vaccines against apicomplexans for use in humans, although the RTS,S malaria 
vaccine is close to licensure. With a few notable exceptions, such as coccidiosis (Eimeria), toxoplasmosis 
(Toxoplasma), and East Coast Fever (Theileria parva) vaccines, very few vaccines against piroplasms have been 
used on a commercial scale, which may be due, in part, to antigenic diversity in these parasites [107]. Genomic 
resources have also recently started to become available for some piroplasms (Table 1.7). These data are critical 
for identification of potential virulence genes, mapping recombination hotspots, and estimate genome-wide 
variation among various isolates, including vaccine strains [105]. One weakness of piroplasm whole-genome 
datasets is their reliance on ab initio gene predictors for the majority of their structural annotations (determining 
where exons start and end in the genome). Given the fact that these genomes are smaller, denser, and more AT- 
rich than most eukaryotes sequenced to date, these gene predictors may not be optimal for gene prediction in 
these genomes, and experimental evidence should be rigorously incorporated into genome re-annotation efforts 
in order to take full advantage of the genome sequences that are present for these apicomplexans. The coupling 
of whole-genome variation data with gene expression data is a powerful method to give insight into gene 
structure, variation and function, and will hopefully assist the design of better prophylaxis against piroplasm- 
mediated diseases. 
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Table 1.5. The first publication of available piroplasm whole genome sequences, and a few features of their genomes. All of these 
are available at PiroplasmaDB, with the exception of B. divergens. 


Assembly Genome anl ecu 
: : > DO 
Genus Species — Strain(s) Year Published Reference Hosts Length (Mbp) %GC TO cin 
encoding genes 
! ! , BOND*, PR, . 
Babesia bigemina BbiS3P, JG29 2014 [91] Bovids 13.8 5] 4,457 
Babesia bovis T2Bo 2007 [92] Bovids 8.2 41.8 3,671 
« 
Babesia divergens te PROVER cO [91] Bovids 9.6 42 4,134 
Babesia microti RI 2012 [89] Rodents 6.5 36 3,513 
Humans 

Cytauxzoon felis Winnie 2013 [93] Felids 9.1 31.8 4,323 
Theileria annulata Ankara 2005 [94] Bovids 8.4 32.5 3,792 
Theileria equi WA 2012 [95] Equids 11.6 39.5 5,330 
Theileria ^ parva Muguga 2005 [90] Bovids 8.3 34.1 4,035 
Theileria orientalis Shintoku 2012 [96] Bovids 9 41.6 3,995 


* = genomic statistics shown for this isolate; %GC = percentage GC content for the whole genome. 


Table 1.6. Whole-genome data for several piroplasm species. These resources are not available at PiroplasmaDB, but can be found 
associated with their respective references. 


Genus Species Strains Year Published Data Type Reference 

Babesia bovis C9.1 2014 WGS [91] 

Babesia divergens None Indicated 2014 WGS, draft assembly [100] 

Babesia microti R1, Gray 2013 Comp ee Genome [101] 

Assembly 

Babesia bovis T2Bo_Vir., T2Bo_Att., L17_Vir., L17_Att., T_Vir., 2011 WGS [102] 
T. Att. 

Theileria parva Marikebüni, Ugaria, MiigogaMarikebmil, 2012 WGS, draft assemblies [103] 
MugugaUganda 
ChitongoZ2, KateteB2, Kiambu Z464/C12, 

Theileria parva MandaliZ22H10, Entebbe, Nyakizu, Katumba, Buffalo 2013 WGS [104] 
LAWR, Buffalo Z5E5 

Theileria parva Muguga, Kiambus, Serengeti-transformed 2015 WGS [105] 


Table 1.7. Gene expression data not found at PiroplasmaDB for several piroplasm species. Most expression data, including more 
EST data, is found at PiroplasmaDB for piroplasms. 


Genus Species Strains Year Published Reference Data Type 

Babesia bovis T2Bo 2007 [108] Microarray 

Babesia bovis T2Bo 2013 [109] RNAseq 

Babesia bovis T2Bo_Vir., T2Bo_Att., L17_Vir., L17_Att., T Vir, T Att. 2013 [109] Microarray, RNAseq 
Babesia bigemina PR 2014 [91] LC-MS 

Cytauxzoon felis Winnie 2013 [93] EST 


Theileria annulata Ankara 2012 [110] Microarray 
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Table 1.7 continued from previous page. 


Genus Species Strains Year Published Reference Data Type 
Theileria annulata Ankara 2013 [111] Microarray 
Theileria annulata Ankara 2013 [112] LC-MS/MS 
Theileria ^ parva Muguga 2005 [113] MPSS 


Plasmodium reference genomes 


To date several complete reference genomes of Plasmodium, the aetiological agent of malaria, have been 
sequenced. Advances in technology have also led to the sequencing of many additional lab strains and clinical 
isolates. The first reference to be published in 2002 was P falciparum 3D7 [12], the species responsible for the 
majority of human morbidity. Additional genomes of species that infect humans have been sequenced (P vivax 
[114]) or are in the process of being sequenced and analysed (P. malariae and P. ovale). The simian-and human 
infecting P. knowlesi [115], the chimpanzee malaria P. reichenowi [116] and the simian malaria parasite P. 
cynomolgi [117] are also part of the reference genome collection. Draft genomes of three rodent malaria parasites 
that are widely used as model systems, P. yoelii yoelii [118], P. chabaudi chabaudi AS and P. berghei ANKA were 
initially sequenced and analysed in 2005 [119]. Due to the highly fragmented nature of these genomes, they were 
re-sequenced in 2014 [120]. Two avian malaria genomes, P relictum and P. gallinaceum have been sequenced and 
are in the process of being analysed. They will provide a valuable missing link to understand the evolutionary 
context of human malaria. All of the published genomes mentioned above can be searched in PlasmoDB [121] 
and GeneDB [21]. 


The publication of P falciparum 3D7 in 2002 was a major milestone [12]. It enabled the malaria community to 
systematically analyse the gene content and tailor their experiments based on genomic data. This is also shown 
by over 2000 citations of the genome paper since publication. After the initial publication, assembly and 
annotation of the P falciparum 3D7 genome has been continuously improved over time. In 2011 a new P 
falciparum 3D7 assembly (version 3) was made publicly available. This new version includes the correction of 
major mis-assemblies. The current genome version has a size of 23.3 Mb and encodes 5429 genes (Table 1.8). It is 
highly AT-rich with a GC-content of only 19.3%. The overall structure of Plasmodium genomes sequenced to 
date is very similar (Table 1.8). The nuclear genome consists of 14 chromosomes, the size ranges from 19Mb to 
26Mb with a comparable number of genes. About three quarters of genes are conserved across all Plasmodium 
genomes, representing the core genome. Plasmodium genomes also exhibit a high degree of synteny.. The 
majority of the variation between Plasmodium species is found in the subtelomeric regions at the end of the 
chromosomes. In these regions, each of the Plasmodium species has a unique set of gene families that are often 
involved in immune evasion and virulence. The most important gene family in P falciparum 3D7 is the VAR 
gene family that encodes the erythrocyte membrane protein 1 (PTEMPI1). PfEMPI plays a role in antigenic 
variation. Of around 60 gene family members, only one protein is expressed on the surface of infected red blood 
cells at a time. PfEMP1 can also bind to host endothelial receptors and therefore plays an important role in 
pathogenicity. Additional gene families include rifins and stevors. It has been recently shown that rifins are 
expressed on the surface of infected red blood cells where they mediate microvascular binding of infected red 
blood cells [122]. The function of stevors is unknown. Both, rifins and stevors belong to the PIR (Plasmodium 
interspersed repeats) superfamily. This superfamily is the only subtelomeric gene family found so far that is 
present in all of the Plasmodium species. 


Closely related to P falciparum is the chimpanzee malaria parasite P. reichenowi. A comparative genomics 
analysis only showed minor differences between these two genomes. There is an almost complete co-linearity in 
the core areas of the genome. The organisation of var genes and other virulence-associated genes is also 
conserved. Differences were found in the reticulocyte-binding proteins, a gene family involved in invasion. These 
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genes encode ligands that are important for the recognition of host erythrocytes. Members of this gene family 
are located on chromosome 13, where two almost identical genes are present in P. falciparum (RH2a and RH2b). 
P reichenowi lacks RH2a, but encodes a new reticulocyte-binding protein, RH7. The most significant difference 
between P. reichenowi and P. falciparum was found in the rifin and stevor multigene families. There are currently 
463 rifins and 66 stevors annotated in P reichenowi, while P. falciparum only encodes 185 rifins. The difference in 
this multigene family also explains the difference in the overall number of genes found in the nuclear genome 
(Table 1.8). 


P vivax is the major source of human malaria outside of Africa. In contrast to P. falciparum this species has a 
dormant stage in the human liver and can stay inactive for years. The nuclear genome of the Salvador I strain P. 
vivax has a size of 26.8 Mb and encodes 5433 genes (Table 1.8). With a GC-content of 42.396 P. vivax has the 
highest GC-content found so far in Plasmodium. Unique to P. vivax is an isochore structure. Chromosomes have 
AT-rich chromosome ends and internal-regions of high GC-content. 


Closely related to P. vivax is the malaria parasite P. knowlesi. P. knowlesi is primarily a simian infecting malaria 
parasite, but has also been reported to cause natural infections in humans mainly in South East Asia. The nuclear 
genome has a size of 24.4 Mb, a GC-content of 38.696, the number of protein-coding genes is 5290 (Table 1.8). 
There are two novel features in the P. knowlesi genome. The major variant gene families that are usually located 
in subtelomeres, are found in chromosome-internal regions dispersed on all 14 chromosomes. These regions are 
often also associated with intrachromosomal telomeric repeats. Another unusual feature unique to P knowlesi is 
a phenomenon called molecular mimicry. KIR proteins that are part of the PIR superfamily contain stretches of 
sequences that are identical to the host proteins AHNAK and CD99, which has a critical immunoregulatory role 
in host T-cell function. It is speculated that these proteins might interfere with host recognition processes. 
Another important gene family is the SICAvar (schizont infected cell agglutination) gene family. SICAvars are 
expressed on the surface of infected erythrocytes and are the largest family of variable surface antigens in P. 
knowlesi. 


Phylogenetically related to P. knowlesi and P. vivax is the simian malaria parasite P cynomolgi. P. cynomolgi is 
used as a model organism for human P vivax infections. Both share the ability to form a dormant liver stage. 
Strain B of P. cynomolgi has been sequenced and published in 2011 [117]. The genome has a size of 26.2 Mb and 
encodes 5722 genes (Table 1.8). Of those, around 9096 have 1:1 orthologs to P. vivax and P. knowlesi. P. cynomolgi 
and P vivax share a common isochore structure, while the presence of intrachromosomal telomeric repeats is 
common to P cynomolgi and P. knowlesi. Comparative genome analysis found a number of copy-number 
variants in multigene families, e.g. in reticulocyte-binding proteins. 


Of particular interest are the rodent malaria parasites, P. berghei, P. chabaudi chabaudi and P. yoelii yoelii. They 
are used as model organisms for experimental studies of human malaria. The genome size of the rodent malaria 
parasite genomes ranges from 18.8 Mb to 22.7 Mb (Table 1.8). The GC-content is around 2296. P. yoelii yoelii has 
the highest number of genes in the nuclear genome, mostly due to a large expansion of PIR genes (980). Gene 
synteny is conserved along the 14 chromosomes, with only one known synteny breakpoint. Analysis of gene 
families in the rodent-infective species reveals that the gene family is the PIR gene family. [120]. The second 
largest gene family encodes fam-a proteins. Fam-a proteins are exported to the infected red blood cell and are 
expanded in the rodent malaria parasites. All other Plasmodium genomes sequenced to date have only one fam-a 
family gene. The number ranges from 161 in P. yoelii yoelii, to 148 in P. chabaudi chabaudi and 74 in P. berghei. 
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Table 1.8. Plasmodium reference genomes. 


P falciparum P. reichenowi P. vivax P knowlesi P cynomolgi B. P berghei P. chabaudi P. 
3D7(3)0 = cpc(v)O = san) HEO @) ANKA . AS(v3)Ó0)  yodii 


(v3) 3) yoelii 
17X 
(v3) 
(3) 
Genome size (Mb) 23.2 24 26.8 24.4 26.2 18.8 18.9 22.7 
No. of 14 14 14 14 14 14 14 14 
chromosomes 
G+C content (96) 19.3 19.2 42.3 38.6 40.4 22 23.6 21.5 
No.ofunassigned 0 237 2745 148 1649 5 0 138 
contigs 
No. of genes (4) 5429 5736 5433 5290 5722 5034 5183 5948 
96 of genes with 54.1 55.9 52.1 54 75.8 52.4 53.5 59.8 
introns 
No. of PIRs (9) 227 529 346 70 256 217 208 980 
manually curated yes yes no yes no yes yes yes 


(1) Carlton et al., Nature 455, 757-63 (2008) 

(2) Tachibana et al., Nat Genet. 44, 1051-5 (2012) 

(3) genome version from 1.10.2015 

(4) including pseudogenes and partial genes, excluding non-coding RNA genes 
(5) including pseudogenes and partial genes 


Trypanosomatids 


Trypanosomatids are a group of parasitic unicellular flagellate eukaryotes. Their range of hosts is diverse and 
includes humans and as well as a wide variety of species from both the animal and plant kingdoms. 
Trypanosomatids belong to the kinetoplastida, which is included in the phylum Euglenozoa, a branch that 
diverged early in the eukaryotic tree [123, 124]. While a number of Kinetoplastida are pathogenic parasites most 
are free-living organisms found in soils and aquatic habitats. The name Kinetoplastida derives from the presence 
of large amounts of mitochondrial DNA, visible by light microscopy as a dense mass known as the kinetoplast 
with its contained DNA referred to as KDNA. Trypanosomatids are obligate parasites that can be monoxenous or 
dixenous (usually an insect vector and other animal or plant [125]). 


Trypanosomatid Genomes 


The nuclear genome of trypanosomatids has some unusual characteristics when compared with other eukaryotic 
genomes. Their genome is organized in polycistronic transcriptional units (PTUs) and the production of 
individual mRNAs from PTUs requires trans-splicing of a splice leader (SL) sequence [126]. PTUs are well 
conserved and exhibit a high degree of synteny between species. The KDNA has an unusual physical structure, 
being arranged in circles of DNA that are interlocked in a chain-mail like network. These mitochondrial mRNAs 
require post-processing in the form of insertion and deletion of uridines before being translated into proteins, a 
process known as RNA editing [127, 128]. Other peculiarities of trypanosomatid genomes include the almost 
complete lack of introns, kinetoplastid-specific histone modifications and histone variants, unique origins of 
replication in some genera, a special DNA base (Base J) [129], and the transcription of protein-coding genes by 
RNA pol I in African trypanosomes, a behavior unique among eukaryotes [130]. Although none of these 
unusual features seem to be exclusive of trypanosomatids and are also present, at least in some basic form, in 
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other free-living kinetoplastids, they may be related to the development of parasitism in trypanosomatids [124, 
131]. 


Regulation of Gene Expression in Polycistronic Transcriptional 
Units 


One of the most striking characteristics of trypanosomatid genomes is the organization of their protein-coding 
genes into long polycistronic transcriptional units (PTUs) that contain tens to hundreds of genes in the same 
orientation. Individual mRNAs are produced from the precursor mRNA by the 5' trans-splicing of a capped 
mini-exon or splice leader sequence (SL), followed by the polyadenylation of the 3' end. The 5' trans-splicing is 
linked to the polyadenylation of the upstream gene. Gene order within PTUs is highly conserved among 
trypanosomatids and the main differences are usually in the regions between the PTUs and at the ends of the 
chromosomes [126, 132]. 


The genes included in a PTU are functionally unrelated and can be expressed at different times of the cell cycle 
or in different life stages. Nonetheless, each PTU is transcribed from a single transcriptional start site (TSS), 
severely limiting the amount of regulation that could be provided by the induction or repression of promoters. In 
some cases correlation between the location of a gene in a PTU and its expression level has been described. For 
example, in T. brucei, genes downregulated after heat shock tend to be closer to the transcriptions start site 
(TSS), while upregulated genes tend to be more distal. Also, the position of the genes along the PTUs correlates 
with gene regulation during the different cell cycle stages. However, most of the genes do not seem to be ordered 
depending on their transcriptional regulation [123, 126, 133]. 


In most organisms, the start of transcription is a fundamental step in the regulation of gene expression. In 
trypanosomatids this layer is constrained, but a swift and specific regulation of gene expression is still needed. 
Dixenous species like T. brucei or L. major, have complex life cycles that require fast and extensive changes in 
morphology and metabolism. These changes depend, ultimately, on changes in gene expression. For example, 
the parasite has to quickly adapt to differences in temperature, energy sources and host immune system [130, 
133]. Besides the regulation at the start of transcription, it is possible to modulate other steps in the transcription 
and translation process. Additional levels of control include transcriptional elongation, mRNA processing 
(trans-splicing and polyadenylation), export from the nucleus, mRNA degradation (in the cytoplasm and 
nucleus), translation (start and elongation) and protein degradation [126, 132]. 


Both mRNA processing and the control of the mRNA stability are important regulatory steps in 
trypanosomatids. The stability of the mRNAs depends on elements present in the 3’ UTRs, for instance, 
duplicated genes in tandem arrays can be differentially regulated due to differences in their 3’ UTRs. In T. brucei, 
the range of half-lives of mature mRNAs is very diverse and is also determined by the life-cyle stage. In addition, 
the half-life ofa mRNA not only depends on the stability of the mature mRNA but also on the rates of 
destruction of the precursor mRNA. If a mRNA undergoes a late or delayed polyadenylation it is more 
susceptible to being degraded, even before finishing maturation [132, 134]. 


Trypanosomatids contain a large number of RNA binding proteins (RBPs) that likely regulate expression levels 
by binding to regulatory elements in the 3’ UTRs of the mRNAs. The amount of RBPs is high compared with the 
number mRNAs. Consequently the current hypothesis proposes the binding of multiple RBPs to each 3’ UTR, 
which would compete or cooperate dynamically with other RBPs. The mix of RBPs would determine the stability 
of the mRNA and could also modulate the translation process [132, 135]. The expression of protein-coding 
genes can also be regulated at the translational level. In ribosome profiling studies it has been shown that there is 
a wide range in the density of ribosomes associated to mRNAs, with differences between life stages. In addition, 
trypanosome mRNAs can contain upstream open reading frames in their 5’ UTRs, which decrease the 
translation of the main ORF [132, 136, 137]. 
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Multi-Copy Families of Surface Proteins 


Genome reduction is frequent in parasites with functions that are essential for a free-living organism becoming 
obsolete inside a host. Surprisingly, compared with other single cell parasitic eukaryotes, trypanosomatid 
genomes do not appear to be specially reduced in size or function. On the contrary, in the evolution of 
parasitism in trypanosomatids the gain of new competences seems to have been more important than the loss of 
functions [134]. One example of this gain of functions is the presence of large multi-copy families that encode 
surface proteins. These families are specific to trypanosomatids and usually have a non-random distribution in 
the genome. A number of them have been implicated in pathogenesis and defence against the host immune 
system, such as the Major Surface Protease (MSP) family of metalloproteases involved in pathogenesis and 
conserved in all trypanosomatids. Other well-known examples are the Variant Surface Glycoprotein (VSG) and 
procyclin in T. brucei, delta-amastin and Promastigote Surface Antigen (PSA) in Leishmania and trans-sialidases 
in T. cruzi [134, 138, 139]. 


Epigenetic regulation 


In eukaryotes, nuclear DNA is organized into a complex of DNA and proteins known as chromatin. The 
nucleosome is the basic unit of the chromatin, providing a sevenfold condensation. It comprises an octamer 
made of 2 copies of each of the core histones (H2A, H2B, H3 and H4) around which approximately 147 bp of 
DNA are wrapped. In addition, there is a histone (H1) in the DNA region between two nucleosomes that helps 
stabilize the chromatin. The chromatin is folded into a 30nm chromatin fiber that can be further compacted, up 
to the level of the distinct chromosomes that can be visualized during the eukaryotic mitosis [126, 129]. 
Although the nucleosomes are still the basic unit of chromatin in trypanosomatids, their histones are divergent 
from those found in yeast and vertebrates. DNA in trypanosomatids is not condensed into the 30nm chromatin 
fiber nor do chromosomes condense during mitosis. However, some differences in the level of condensation 
between life-cycle stages have been described [126, 130]. 


Mechanisms that influence the structure of chromatin have been implicated in the regulation of gene expression. 
In trypanosomatids, as in other eukaryotes, specific modifications of the N-terminal tails of histones, or the 
presence of histone variants correlate with regions of active or repressed transcription. As of yet, no conserved 
sequences have been identified in the transcription start sites (TSSs) of the PTUs. It has been proposed that TSSs 
could be determined by chromatin structure rather than the presence of conserved sequence motifs. Some of the 
histone modifications described in trypanosomatids are common in eukaryotes, but there are also some 
modifications and histone variations specific to trypanosomatids, such as H3V and H4V (probable markers of 
transcription termination sites) [129, 140]. 


Mitochondrial Genome: Architecture and RNA Editing 


The kDNA is made up of circles of DNA that are interlocked in a chain-mail like network and are of two types: 
maxicircles and minicircles. Maxicircles store information for classical mitochondrial genes and proteins, but 
their transcripts require RNA editing, the insertion or deletion of uridines, before being translated. Minicircles 
encode guide RNAs (gRNA), that act as templates during the editing process. Unlike other eukaryotes, 
mitochondrial tRNAs are found in the nuclear genome and require specific target sequences to be transported 
into the mitochondria [123, 128]. The mitochondrial genome contains a few dozens of maxicircles, with identical 
sequence and a size of 20-40 kb, and thousands of minicircles. Minicircles differ in sequence content but their 
size is species specific and uniform, usually between 0.5-10 kb. Maxicircles are concatenated together and 
simultaneously interlinked with the minicircle network. The DNA network and associated proteins are 
organized in a dense disc visible by light microscopy. While all kinetoplastids contain maxi and minicircles, the 
concatenated network is unique to trypanosomatids [124, 141]. During the RNA editing uridines are inserted or 
deleted from mitochondrial mRNAs fixing errors in the sequence and restoring a viable coding sequence. The 
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sequences to be used as templates are stored in the gRNAs (50-60nt). These encode only a small portion of the 
information needed to repair a mRNA, therefore multiple gRNA are required to edit each mRNA. RNA editing 
is catalysed by the RNA editing core complex or editosome. Several modules can combine to build different 
versions of the editosome, each with different specificities [127, 142]. 


Base J 


Base J, or the modification of thymine to beta-D-glucosyl hydroxymethyluracil, is enriched at the ends of PTUs, 
at potential transcription terminal sites (TTSs) and in repetitive DNA elements, such as the telomeric repeats 
[123, 129]. 


Transposable Elements 


The two main classes of transposable element are DNA transposons and RNA retrotransposons. DNA 
transposons move by “cut and paste” and depend on a DNA intermediate, while RNA retrotransposons use a 
“copy and paste” strategy, with a RNA intermediate. DNA transposons have not been found in trypanosomatid 
genomes, butRNA retrotransposons have been shown to be present. For example, several classes of potentially 
active retrotransposons have been identified in T. brucei and T. cruzi, some of which could be involved in the 
regulation of gene expression, such as SIDER2, which localizes to the 3'UTRs of mRNAs and affects its stability 
[126, 143, 144]. 


Sequenced Genomes 


The first trypanosomatids sequenced, were Trypanosoma brucei, Trypanosoma cruzi and Leishmania major, the 
causative agents of Sleeping Sickness, Chagas disease and Leishmaniasis in humans [145-147]. Since then, the 
genomes of other medically relevant trypanosomes have been published. Leishmania species that have been 
sequenced include L. donovani [148], L. infantum, L. brasiliensis [149], L. mexicana [138], L. panamensis [150], 
L. peruviana [151], L. amazonensis [152]. Trypanosoma species include T. rangeli [153]. Apart from the reference 
genomes, multiple strains and hundreds of isolates have been sequenced and are available in the databases 

[Tri TrypDB, NCBI]. The range of published genomes has expanded to other dixenous species and includes 
parasites of reptiles (Trypanosoma grayi [154] and Leishmania tarentolae [155]), parasites of livestock (T. evansi 
[156]) or parasites of plants (Phytomonas serpens, Phytomonas spp. [157, 158]). In addition, the genomes of a few 
monoxenous trypanosomatids have been published (Leptomonas seymouri [159] and Lotmaria passim [160]). 
Some of theses species harbour symbiotic bacteria and have been used as a model to study the evolution of 
organelles (Crithidia acanthocephali, Herpetomonas muscarum, Strigomonas oncopelti, Strigomonas galati and 
Strigomonas culicis, Angomonas desouzai and Angomonas deanei) [161]. Additional genomes are available pre- 
publication in the genome databases (TriTrypDB, NCBI) include Endotrypanum monterogeii, Leptomonas 
pyrrhocoris, Crithidia fasciculata; the Leishmanias L aethiopica, L. tropica, L. gerbilli, Leishmania enriettii, L. 
turanica, and the Trypanosomas T. congolense and T. vivax. 


Toxoplasma and related organisms 


Toxoplasma gondii is a member of the tissue cyst-forming coccidian parasites, which include Neospora caninum, 
Hammondia hammondia and Sarcocystis spp. among others [162-165]. Of these T. gondii appears to be the most 
widely distributed both geographically and by host diversity, and is able to infect virtually any warm-blooded 
animal. While the diversity of T. gondii is restricted to three clonal lineages in Europe and North America, 
isolates from the southern hemisphere exhibit much wider genetic variability [166]. Amazingly, while T. gondii 
can infect a wide variety of warm-blooded organisms it can only undergo sexual recombination in Felidae. Cats 
shed infective sporozoites containing environmentally resistant cysts, which can be transmitted orally to other 
organisms such as rodents [165]. Following oral infection, sporozoites cross the small intestine and can infect a 
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variety of cells where they undergo a developmental switch to fast growing tachyzoites [166]. Tachyzoites 
replicate through a process called endodyogeny where two daughter cells are formed within a mother cell by a 
combination of de novo building of cytoskeletal and secretory components, replication and segregation of 
mother cell components (i.e. nucleus, mitochondria and apicoplast) and recycling of mother cell components 
[167, 168]. Pressure from the host immune system forces tachyzoites to undergo another developmental change 
into bradyzoites [169]. These semi-quiescent cells form clusters called tissue cysts that settle in brain and/or 
muscle tissue where they may remain for the life of the host, althoughreactivation of bradyzoites can occur in 
immunocompromised individuals. Bradyzoites also serve as a reservoir of transmission if an infected host is 
eaten by another animal. Interestingly, the tissue cyst tropism varies markedly between hosts. The fast replicating 
tachyzoite stage is often asymptomatic, but can cause acute morbidity or mortality in immunocompromised 
individuals. Placental transmission is known to cause foetal mortality or serious congenital defects. 


T. gondii contains a ~65 Mb nuclear genome comprising 14 chromosomes [170-172], a 35 Kb apicoplast genome 
[173] and a mitochondrial genome. T. gondii genomic scale data such as expressed sequenced tags, sequenced 
BAC clones and whole genome shotgun sequencing were first made available through ToxoDB beginning in 
2001 [174]. Since then, additional genomic scale data have been generated including genome sequence and 
transcriptomic data from a large scale population sequencing project [172]. The genome of the closely related H. 
hammondia and N. caninum are ~65 Mb and ~62 Mb in size, respectively, and not surprisingly also comprise 14 
chromosomes each (Table 1.9) [171, 172, 175]. The genome of the more divergent S. neurona is almost twice the 
size of those previously described at ~130Mb, while a GC content of roughly 5396 is common across this group 
(Table 1.9) [176]. A high degree of genomic synteny is observed between T. gondii, H. hammondia and N. 
caninum. This level of synteny is not maintained with between this group and S. neurona. [171, 172, 176]. 


Apicomplexan parasites in general have evolved secretory systems that transport effector molecules into their 
host cells. These have a range of functions, including modification the intracellular environment, promotion of 
immune evasion and modulation of host-cell transcription [177]. Most information about secretory effectors in 
coccidian parasites comes from T. gondii where numerous studies have defined dense granule [178], rhoptry 
[179], microneme [180] and SAGI related sequences (SRS) proteins [181]. Comparative genomic analysis 
revealed that one of the primary features differentiating both different species of coccidian parasite and different 
strains of T. gondii is sequence diversity and copy number variation (CNV) at secretory effector loci. [171, 172, 
175]. A comparison of 62 isolates of T. gondii and one isolate of H. hammondia showed that secretory effectors 
are often found in genomic regions exhibiting tandem amplification [172]. A comparison of reference isolates 
from the 16 major Toxoplasma haplogroups showed that all possess a repertoire of secretory effectors with most 
diversity occurring in rhoptry and SRS genes. Further comparison of secretory effectors between T. gondii, H. 
hammondia and N. caninum revealed additional diversity and a T. gondii specific family (TgFAMs) of effectors, 
which may be important for host range and definitive host preferences [172]. Interestingly, a number of the 
TgFAMs are clustered in telomeric regions and contain a variable region, which may implicate them in immune 
evasion [172, 182] but they also may play a role in during sexual development since many are expressed in the 
cat and in oocysts [183]. 


Table 1.9. Basic genome statistics for T. gondii and related organisms. 
Toxoplasma gondii* Hammondia hammondi H.H.34 Neospora caninum Liverpool Sarcocystis neurona** 
Genome size (Mb) 63 65 62 128 
No. of chromosomes 14 14 14 ND 
No. of genes 8707 8176 7266 7140 
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Table 1.9 continued from previous page. 
Toxoplasma gondii* Hammondia hammondi H.H.34 Neospora caninum Liverpool Sarcocystis neurona** 
96 of genes with introns 76 76 77 81 


* Average statistics from three strains: ME49, VEG and GTI 
** Average statistics from two strains: SN1 and SN3 
ND = not determined 


Data integration and accessibility 


Several databases exist that provide online and free access to parasite genomes, annotation and functional data 
(Table 2.0). The National Institutes of Allergy and Infectious Diseases (NIAID) in the United States initiated 
established bioinformatics resource centers (BRCs) in 2004 whose goal is to provide the global pathogen 
research community with free and online tools to mine genomic and functional genomic data, and additional 
data-types essential for pathogen surveillance and control [184]. The BRCs included one specifically tasked with 
providing support for the eukaryotic pathogen scientific community (EuPathDB, initially known as ApiDB) 
[185]. Now in its third five-year funding cycle, EuPathDB incorporates data from over 240 parasitic and 
evolutionarily related organisms spanning multiple phyla such as the Amoebozoa, Apicomplexa, Euglenozoa, 
Metamonada, Sarcomastigophora and numerous fungal phyla. Data includes genome sequence, structural and 
functional annotation, functional data covering the omics landscape including transcriptomic, proteomic and 
metabolomics. Most current database content can be accessed here: http://eupathdb.org/eupathdb/ 
eupathGenome.jsp 


Data within EuPathDB and its component sites are searchable via an intuitive graphical user interface that allows 
the development of complex in silico experiments to support hypothesis driven experiments. Data types include 
the underlying genomic sequences and annotations (close to 250 genomes represented), transcript level data 
(SAGE-tag, EST, microarray and RNA sequence data), protein expression data (including quantitative), 
epigenomic data (ChIP-chip and ChIP-seq), population-level (SNP) and isolate data, and host response data 
(antibody array). In addition, genomic analyses provide the ability to search for gene features, subcellular 
localization, motifs (InterPro and user defined), function (Enzyme commission annotation and GO terms) and 
evolutionary relationships based on gene orthology. Detailed tutorials and usage instructions are available 
through publications and online tutorials and exercises [121, 186]. A number of YouTube tutorials are available: 
https://www.youtube.com/user/EuPathDB/. EuPathDB resources provide Community annotation and curation 
via user comments (including images, files, PubMed records, etc) can be added to records in EuPathDB sites 
(Comments become immediately visible and searchable). A Graphical search system allows building complex 
searches in a step-wise manner that can be saved, modified and shared. An example strategy can be seem in 
figure 1 and accessed online by following this link: http://plasmodb.org/plasmo/im.do?s=df42a7 lae3acbble. 
Browsing capability through a genome browser integrating genomes, annotation, analyses and functional data. 
Column and results analysis tools are also available to generate word cloud graphics, histograms, and GO term 
and pathways enrichment analyses. 
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Fig. 1. Screen shot from PlasmoDB depicting a search strategy that identifies putative phosphatases that are secreted and 
expressed in gametocytes based on proteomics, RNA sequence and microarray experiments. Search strategies are constructed by 
adding steps that query underlying data. Step 1, identifies all putative phosphatases based on a text search. Step 2, identifies any 
of the genes in step 1 that also have a secretory signal peptide, at least one transmembrane domain or both (see expanded view of 
“Secreted”). Step 3 identifies any genes in step 2 that have evidence of expression based on data from three experiments in P. 
falciparum (See expanded view “Gametocytes”) [120]Florens:2002bf, Silvestrini:2010io}. Step form transforms the results in step 
3 to all orthologs in PlasmoDB. 


Table 2.0. Online resources for genomic scale data. 


Resource Name 


National Center for 
Biotechnology 
Information 


The European 
Bioinformatics 
Institute 


DNA Data Bank of 
Japan 


Ensembl Protists 


GeneDB 


The Eukaryotic 
Pathogen Databases 


Acronym 


NCBI 


EMBL-EBI 


DDBJ 


EnsemblProtists 


GeneDB 


EuPathDB 


Content and functionality 


Data repository and search capability (International 
Nucleotide Sequence Database Collaboration) 


Data repository and search capability (International 
Nucleotide Sequence Database Collaboration) 


Data repository and search capability (International 
Nucleotide Sequence Database Collaboration) 


Part of the larger Ensembl genomes which is a joint European 
Bioinformatics Institute and the Wellcome Trust Sanger 
Institute project providing Ensembl tools, data visualization, 
data minning and comparative analysis 


Core part of the Sanger Institutes Pathogen Genomics 
initiative. Provides early access to the latest sequence data 
and annotation/curation. In addition, the site includes some 
basic search functionality and genome browsing. 


One of four National Institutes of Allergy and Infectious 
Diseases Bioinformatic Centers. Provides integrated search 
capabilities of genomes and functional data dedicated to 
eukaryotic pathogens (and related organisms). Includes 
AmoebaDB, FungiDB, GiardiaDB, MicrosporidiaDB, 
PiroplasmaDB, PlasmoDB, ToxoDB, TrichDB, TriTrypDB, 
OrthoMCL and HostDB. 


Web address (URL) 


http://www.ncbi.nlm.nih.gov 


http://www.ebi.ac.uk 


http://www.ddbj.nig.ac.jp/ 


http://protists.ensembl.org/ 


http://www.genedb.org/ 


http://EuPathDB.org 
http://amoebadb.org 
http://cryptodb.org 
http://fungidb.org 
http://microsporidiadb.org 
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Table 2.0 continued from previous page. 


Resource Name Acronym Content and functionality Web address (URL) 


http://piroplasmadb.org 
http://plasmodb.org 
http://toxodb.org 
http://trichdb.org 
http://tritrypdb.org 
http://orthomcl.org 
http://hostdb.org 
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